001 /* 002 * CrochemoreLandauZivUkelson.java 003 * 004 * Copyright 2003 Sergio Anibal de Carvalho Junior 005 * 006 * This file is part of NeoBio. 007 * 008 * NeoBio is free software; you can redistribute it and/or modify it under the terms of 009 * the GNU General Public License as published by the Free Software Foundation; either 010 * version 2 of the License, or (at your option) any later version. 011 * 012 * NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 013 * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR 014 * PURPOSE. See the GNU General Public License for more details. 015 * 016 * You should have received a copy of the GNU General Public License along with NeoBio; 017 * if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, 018 * Boston, MA 02111-1307, USA. 019 * 020 * Proper attribution of the author as the source of the software would be appreciated. 021 * 022 * Sergio Anibal de Carvalho Junior mailto:sergioanibaljr@users.sourceforge.net 023 * Department of Computer Science http://www.dcs.kcl.ac.uk 024 * King's College London, UK http://www.kcl.ac.uk 025 * 026 * Please visit http://neobio.sourceforge.net 027 * 028 * This project was supervised by Professor Maxime Crochemore. 029 * 030 */ 031 032 package neobio.alignment; 033 034 import java.io.Reader; 035 import java.io.IOException; 036 037 /** 038 * This abstract class is the superclass of both global and local sequence alignment 039 * algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and 040 * Michal Ziv-Ukelson (2002). 041 * 042 * <P>This implementation derives from the paper of M.Crochemore, G.Landau and 043 * M.Ziv-Ukelson, <I>A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring 044 * Matrices</I> (available here as 045 * <A HREF="doc-files/Crochemore_Landau_Ziv-Ukelson_algorithm.pdf">PDF</A> or 046 * <A HREF="doc-files/Crochemore_Landau_Ziv-Ukelson_algorithm.pdf">Postscript</A>).</P> 047 * 048 * <P>It employs Lempel-Ziv compression techniques to speed up the classic dynamic 049 * programmin approach to sequence alignment (see {@linkplain NeedlemanWunsch} and 050 * {@linkplain SmithWaterman} classes). It reduces the time and space complexity from 051 * O(n<SUP>2</SUP>) to O(n<SUP>2</SUP>/log n). In fact, the complexity is actually O(h 052 * n<SUP>2</SUP>/log n), where 0 <= h <= 1 is a real number denoting the entropy of the 053 * text (a measure of how compressible it is). This means that, the more compressible a 054 * sequence is, the less memory the algorithm will require, and the faster it will 055 * run.</P> 056 * 057 * <P>The idea behind this improvement is to identify repetitions in the sequences and 058 * reuse the computation of their alignments. The first step is, therefore, to parse the 059 * sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the 060 * Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished 061 * by the {@linkplain FactorSequence} class (for more information about this 062 * factorisation, please refer to the specification of that class) that builds a 063 * doubly-linked list of factors. Each factor is an instance of the {@linkplain Factor} 064 * class (refer to the specification of this class for more information).</P> 065 * 066 * <P>Once the sequences have been parsed, the algorithm builds a matrix of blocks, called 067 * block table, that is vaguely similar to the dynamic programming matrix used by both 068 * <CODE>NeedlemanWunsch</CODE> and <CODE>SmithWaterman</CODE>. Each block contains an 069 * instance of an {@linkplain AlignmentBlock} (please refer to its specification for more 070 * information on what information is stored) and represents the alignment beteween one 071 * factor of each sequence. This block table is, in fact, a partition of the alignment 072 * graph.</P> 073 * 074 * <P>Consider a block B which corresponds to the alignment of factor F1 = xa from 075 * sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of 076 * S1 with character a, while F2 extends a previous factor of S2 with character b. We can 077 * define the input border of B as the set of values at the left and top borders of block 078 * B, and the output border as the set of values at the right and bottom borders of B. 079 * Moreover, we can define the following prefix blocks of B:</P> 080 * 081 * <UL> 082 * <LI><B>Left prefix</B> - is the block that contains the alignment of factor F1 with 083 * a factor F2' = y (a prefix of factor F2). 084 * <LI><B>Diagonal prefix</B> - is the block that contains the alignment of factor F1' = x 085 * (a prefix of factor F1) with a factor F2' = y (a prefix of factor F2). 086 * <LI><B>Top prefix</B> - is the block that contains the alignment of factor F1' = x (a 087 * prefix of factor F1) with factor F2. 088 * </UL> 089 * 090 * <P>Note that each factor has a pointer to its prefix factor, called ancestor (see the 091 * specification of the <CODE>Factor</CODE> class). This pointer makes it easy to retrieve 092 * any of the three prefix blocks of B in constant time.</P> 093 * 094 * <P>Rather than computing each value of the alignment block B, the algorithm will only 095 * compute the values on its input and output borders. This is precisely what makes it 096 * more efficient.</P> 097 * 098 * <P>In this class there is a general specification of how the block table is computed 099 * (see the {@link #computeBlockTable computeBlockTable} method for details). The actual 100 * method depends on the subclasses. In general, there are two phases:</P> 101 * 102 * <UL> 103 * <LI><B>Encoding</B> - the structure of a block B is studied and represented in an 104 * efficient way by computing weights of optimal paths connecting each entry of its input 105 * border to each entry of its output border. This information is encoded in a DIST matrix 106 * where DIST[i,j] stores the weight of an optimal paths connecting entry i of the input 107 * border to entry j the output border of B. 108 * <LI><B>Propagation</B> - the input border of a block B is retrieved (from the left and 109 * top blocks of B) and its output border is computed with the help of the DIST matrix. 110 * </UL> 111 * 112 * <P>In fact, for each block, only one column of the DIST matrix needs to be computed, 113 * all other columns are actually retrieved from its prefix blocks. This is precisely what 114 * is accomplished by the {@link #assembleDistMatrix assembleDistMatrix} method found in 115 * this class (it is general enough for both global and local alignment versions of the 116 * algorithm.</P> 117 * 118 * <P>From the DIST matrix, we obtain the OUT matrix defined as 119 * <CODE>OUT[i,j] = I[i] + DIST[i,j]</CODE> where I is the input border array. This means 120 * that the OUT matrix is the DIST matrix updated by the input border of a block. The 121 * output border is then computed from the OUT matrix by taking the maximum value of each 122 * column. This class also have a general method for assembling the input border (see 123 * {@link #assembleInputBorder assembleInputBorder}</P> 124 * 125 * <P>The OUT matrix is encoded by the {@linkplain OutMatrix} class that takes as 126 * both a DIST matrix and an input border array. Note that <B>it does not compute the OUT 127 * matrix</B>, it just stores the necessary information to retrieve a value at any 128 * position of the matrix.</P> 129 * 130 * <P>A naive approach to compute the output border of a block from the OUT matrix of size 131 * n x n would take a time proportional to n<SUP>2</SUP>. However, it happens that, due to 132 * the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which 133 * implies that they are also <I>totally monotone</I>. This property allows the 134 * computation of the output border of B in linear time with the SMAWK algorithm (see the 135 * specification of the {@linkplain Smawk} class for more information on SMAWK).</P> 136 * 137 * <P>This class contains a general specification that is pertinent to both global and 138 * local versions of the algorithm. For more information on each version of, please refer 139 * to the appropriate subclass.</P> 140 * 141 * <P><B>A note about the performance of these algorithms.</B> Although theoretical 142 * results suggest that these algorithms are faster and consume less memory than the 143 * classical methods, in practice it is hard to realise their performance gains. 144 * 145 * <P>These algorithms are extremely complex and require the storage of many extra 146 * pointers and other auxiliary data for each block (see the <CODE>AlignmentBlock</CODE> 147 * class for more details). Hence, even though the space requirement is 148 * O(n<SUP>2</SUP>/log n), which is less than O(n<SUP>2</SUP>), in practice, for most of 149 * the cases these algorithms will take more time and memory space than their clasical 150 * counterparts (we have to keep in mind that the Big Oh notation ignores all constants 151 * involved).</P> 152 * 153 * <P>Therefore, in order to realise the full power of these algorithms, they have to be 154 * used with extremly large and redundant sequences. This will allow a proper 155 * reutilisation of the computations and, maybe, provide an improvement in terms of space 156 * and run time. For instance, it is easy to devise such a sequence if we use a 157 * one-character alphabet because, in this case, a sequence is factorised into a series 158 * of factors that are a prefix of the next.</P> 159 * 160 * @author Sergio A. de Carvalho Jr. 161 * @see CrochemoreLandauZivUkelsonGlobalAlignment 162 * @see CrochemoreLandauZivUkelsonLocalAlignment 163 * @see NeedlemanWunsch 164 * @see SmithWaterman 165 * @see FactorSequence 166 * @see AlignmentBlock 167 * @see OutMatrix 168 * @see Smawk 169 * @see #computeBlockTable 170 * @see #assembleDistMatrix 171 */ 172 public abstract class CrochemoreLandauZivUkelson extends PairwiseAlignmentAlgorithm 173 { 174 /** 175 * A constant that indicates that the source of an optimal path has been reached in a 176 * block and that the trace back procedure to retrieve a high scoring alignment can 177 * stop. 178 */ 179 protected static final byte STOP_DIRECTION = 0; 180 181 /** 182 * A constant that indicates that the left direction must be followed to reach the 183 * source of an optimal path in a block during the trace back procedure to retrieve a 184 * high scoring alignment. 185 */ 186 protected static final byte LEFT_DIRECTION = 1; 187 188 /** 189 * A constant that indicates that the diagonal direction must be followed to reach the 190 * source of an optimal path in a block during the trace back procedure to retrieve a 191 * high scoring alignment. 192 */ 193 protected static final byte DIAGONAL_DIRECTION = 2; 194 195 /** 196 * A constant that indicates that the top direction must be followed to reach the 197 * source of an optimal path in a block during the trace back procedure to retrieve a 198 * high scoring alignment. 199 */ 200 protected static final byte TOP_DIRECTION = 3; 201 202 /** 203 * The first factorised sequence being aligned. 204 */ 205 protected FactorSequence seq1; 206 207 /** 208 * The second factorised sequence being aligned. 209 */ 210 protected FactorSequence seq2; 211 212 /** 213 * The block table, which is a matrix of alignment blocks where each block represents 214 * the alignment between one factor of each sequence. 215 */ 216 protected AlignmentBlock[][] block_table; 217 218 /** 219 * Number of rows of the block table. It is determined by the number of factors of the 220 * first sequence. 221 */ 222 protected int num_rows; 223 224 /** 225 * Number of columns of the block table. It is determined by the number of factors of 226 * the second sequence. 227 */ 228 protected int num_cols; 229 230 /** 231 * An instance of the <CODE>Smawk</CODE> class that implements the SMAWK algorithm to 232 * compute the column maxima of a totally monotone matrix. It is used to speed up the 233 * computation of the output border of a block. 234 */ 235 protected Smawk smawk = new Smawk(); 236 237 /** 238 * An instance of the <CODE>OutMatrix</CODE> class that encodes the OUT matrix of a 239 * block when supplied with the DIST matrix and the input border array of a block. 240 * Note that it does not compute the OUT matrix itselft, it just stores the necessary 241 * information to retrieve a value at any position of the matrix. 242 * 243 * <P>This object is then used to compute the output border of a block with the 244 * <CODE>Smawk</CODE> class. Note that the <CODE>OutMatrix</CODE> class implements the 245 * <CODE>Matrix</CODE> interface as required by the <CODE>Smawk</CODE> class. 246 * 247 * @see Matrix 248 * @see Smawk 249 */ 250 protected OutMatrix out_matrix = new OutMatrix (); 251 252 /** 253 * Loads sequences into <CODE>FactorSequence</CODE> instances. In case of any error, 254 * an exception is raised by the constructor of <CODE>FactorSequence</CODE> (please 255 * check the specification of that class for specific requirements). 256 * 257 * <P>A <CODE>FactorSequence</CODE> is an LZ78-like factorisation of the sequences 258 * being aligned. 259 * 260 * @param input1 Input for first sequence 261 * @param input2 Input for second sequence 262 * @throws IOException If an I/O error occurs when reading the sequences 263 * @throws InvalidSequenceException If the sequences are not valid 264 * @see FactorSequence 265 */ 266 protected void loadSequencesInternal (Reader input1, Reader input2) 267 throws IOException, InvalidSequenceException 268 { 269 // load sequences into instances of CharSequence 270 this.seq1 = new FactorSequence(input1); 271 this.seq2 = new FactorSequence(input2); 272 273 // determine the block table's dimensions 274 this.num_rows = seq1.numFactors(); 275 this.num_cols = seq2.numFactors(); 276 } 277 278 /** 279 * Frees pointers to loaded sequences and the the block table so that their data can 280 * be garbage collected. 281 */ 282 protected void unloadSequencesInternal () 283 { 284 this.seq1 = null; 285 this.seq2 = null; 286 this.block_table = null; 287 } 288 289 /** 290 * Computes the block table (the result depends on subclasses, see 291 * <CODE>computeBlockTable</CODE> for details) and requests subclasses to retrieve an 292 * optimal alignment between the loaded sequences. The actual product depends on the 293 * subclasses which can produce a global (see 294 * <CODE>CrochemoreLandauZivUkelsonGlobalAlignment</CODE>) or local alignment (see 295 * <CODE>CrochemoreLandauZivUkelsonLocalAlignment</CODE>). 296 * 297 * <P>Subclasses are required to implement the <CODE>buildOptimalAlignment</CODE> 298 * abstract method defined by this class according to their own method.</P> 299 * 300 * @return an optimal alignment between the loaded sequences 301 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 302 * with the loaded sequences. 303 * @see CrochemoreLandauZivUkelsonGlobalAlignment 304 * @see CrochemoreLandauZivUkelsonLocalAlignment 305 * @see #computeBlockTable 306 * @see #buildOptimalAlignment 307 */ 308 protected PairwiseAlignment computePairwiseAlignment () 309 throws IncompatibleScoringSchemeException 310 { 311 // compute block table 312 computeBlockTable (); 313 314 // build and return an optimal global alignment 315 PairwiseAlignment alignment = buildOptimalAlignment (); 316 317 // allow the block table to be garbage collected 318 block_table = null; 319 320 return alignment; 321 } 322 323 /** 324 * Computes the block table (the result depends on subclasses, see 325 * <CODE>computeBlockTable</CODE> for details) and requests subclasses to locate the 326 * score of the highest scoring alignment between the two sequences in the block 327 * table. The result depends on the subclasses, and either a global alignment 328 * (see <CODE>CrochemoreLandauZivUkelsonGlobalAlignment</CODE>) or local alignment 329 * score (see <CODE>CrochemoreLandauZivUkelsonLocalAlignment</CODE>) will be produced. 330 * 331 * <P>Subclasses are required to implement the <CODE>locateScore</CODE> abstract 332 * method defined by this class according to their own method.</P> 333 * 334 * <P>Note that this method calculates the similarity value only (it doesn't trace 335 * back into the block table to retrieve the alignment itself).</P> 336 * 337 * @return the score of the highest scoring alignment between the loaded sequences 338 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 339 * with the loaded sequences. 340 * @see CrochemoreLandauZivUkelsonGlobalAlignment 341 * @see CrochemoreLandauZivUkelsonLocalAlignment 342 * @see #locateScore 343 */ 344 protected int computeScore () throws IncompatibleScoringSchemeException 345 { 346 // compute block table 347 computeBlockTable (); 348 349 // get score 350 int score = locateScore (); 351 352 // allow the block table to be garbage collected 353 block_table = null; 354 355 return score; 356 } 357 358 /** 359 * Computes the block table. This method is a general specification of how the block 360 * table should be computed. It creates the block table according to the number of 361 * factors of each sequence. It then goes over each position of the block table, 362 * retrieves the corresponding factors from each sequence, and repasses the 363 * information to the subclasses that will do the actual computation of each block 364 * using the scoring scheme previously set. 365 * 366 * <P>There are four different cases that defines four abstract methods in this class, 367 * which subclasses must implement:</P> 368 * 369 * <UL> 370 * <LI><B>createRootBlock</B> - creates and computes block at row 0, column 0; 371 * <LI><B>createFirstColumnBlock</B> - creates and computes blocks at column 0 which 372 * corresponds to alignment blocks between factors of sequence 1 and an empty string; 373 * <LI><B>createFirstRowBlock</B> - creates and computes blocks at row 0 which 374 * corresponds to alignment blocks between factors of sequence 2 and an empty string; 375 * <LI><B>createBlock</B> - creates and computes blocks at row > 0 and column > 0 376 * which corresponds to alignment blocks between one factor of sequence 1 and one 377 * factor of sequence 2; 378 * </UL> 379 * 380 * <P>Note that each factor has a serial number which indicates its order in the list 381 * of factors of a sequence. This number will match with the row and column index of 382 * a block in the block table. For instance, if a block has factors F1 and F2 with 383 * serial numbers 12 and 53, respectively, this means that this block is found at row 384 * 12, column 53 of the block table.</P> 385 * 386 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 387 * with the loaded sequences. 388 * @see #createRootBlock 389 * @see #createFirstColumnBlock 390 * @see #createFirstRowBlock 391 * @see #createBlock 392 */ 393 protected void computeBlockTable () throws IncompatibleScoringSchemeException 394 { 395 Factor factor1, factor2; 396 int r, c, max_length; 397 398 // create block table 399 block_table = new AlignmentBlock[num_rows][num_cols]; 400 401 // find the length of the longest sequence (number of characters) 402 max_length = Math.max(seq1.numChars(), seq2.numChars()); 403 404 // prepares the OUT matrix object 405 out_matrix.init (max_length, scoring.maxAbsoluteScore()); 406 407 // start at the root of each trie 408 factor1 = seq1.getRootFactor(); 409 factor2 = seq2.getRootFactor(); 410 411 // check if roots' indexes are both zero 412 if (factor1.getSerialNumber() != 0 || factor2.getSerialNumber() != 0) 413 throw new IndexOutOfBoundsException ("Unexpected factor index."); 414 415 // initiate first cell of block table 416 block_table[0][0] = createRootBlock (factor1, factor2); 417 418 // compute first row 419 for (c = 1; c < num_cols; c++) 420 { 421 factor2 = factor2.getNext(); 422 423 // check if factor2's index equals its column number (except for 424 // the last factor that can be a repetition of a previous one) 425 if (c < num_cols - 1 && factor2.getSerialNumber() != c) 426 throw new IndexOutOfBoundsException ("Unexpected factor index."); 427 428 block_table[0][c] = createFirstRowBlock (factor1, factor2, c); 429 } 430 431 for (r = 1; r < num_rows; r++) 432 { 433 factor1 = factor1.getNext(); 434 435 // check if factor1's index equals its row number (except for 436 // the last factor that can be a repetition of a previous one) 437 if (r < num_rows - 1 && factor1.getSerialNumber() != r) 438 throw new IndexOutOfBoundsException ("Unexpected factor index."); 439 440 // go back to the root of sequence 2 441 factor2 = seq2.getRootFactor(); 442 443 if (factor2.getSerialNumber() != 0) 444 throw new IndexOutOfBoundsException ("Unexpected factor index."); 445 446 // compute first column of current row 447 block_table[r][0] = createFirstColumnBlock (factor1, factor2, r); 448 449 for (c = 1; c < num_cols; c++) 450 { 451 factor2 = factor2.getNext(); 452 453 // check if factor2's index equals its column number (except for 454 // the last factor that can be a repetition of a previous one) 455 if (c < num_cols - 1 && factor2.getSerialNumber() != c) 456 throw new IndexOutOfBoundsException ("Unexpected factor index."); 457 458 // compute row r, col c 459 block_table[r][c] = createBlock (factor1, factor2, r, c); 460 } 461 } 462 } 463 464 /** 465 * Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix 466 * blocks. In fact, it also stores pointers to the owner block for each column 467 * retrieved. These pointers are later used during the trace back procedure that 468 * builds an optimal alignment from the information computed in the block table. This 469 * method is general enough to suit both global and local alignment versions of the 470 * algorithm. 471 * 472 * @param block the block for which the DIST matrix is needed 473 * @param dim the dimension of the DIST matrix 474 * @param r the row index of this block in the block table 475 * @param c the column index of this block in the block table 476 * @param lc the number of columns of the alignment block 477 * @return the DIST matrix 478 */ 479 protected int[][] assembleDistMatrix (AlignmentBlock block, int dim, int r, int c, 480 int lc) 481 { 482 AlignmentBlock ancestor; 483 Factor parent; 484 int[][] dist; 485 int i; 486 487 dist = new int[dim][]; 488 489 // columns to the left of lc 490 parent = block.factor2.getAncestor(); 491 for (i = lc - 1; i >= 0; i--) 492 { 493 ancestor = block_table[r][parent.getSerialNumber()]; 494 block.ancestor[i] = ancestor; 495 dist[i] = ancestor.dist_column; 496 parent = parent.getAncestor(); 497 } 498 499 // column lc 500 dist[lc] = block.dist_column; 501 block.ancestor[lc] = block; 502 503 // columns to the right of lc 504 parent = block.factor1.getAncestor(); 505 for (i = lc + 1; i < dim; i++) 506 { 507 ancestor = block_table[parent.getSerialNumber()][c]; 508 block.ancestor[i] = ancestor; 509 dist[i] = ancestor.dist_column; 510 parent = parent.getAncestor(); 511 } 512 513 return dist; 514 } 515 516 /** 517 * Assembles the input border of a block by retrieving the values at the output 518 * borders of the left and top blocks. This method is general enough to suit both 519 * global and local alignment versions of the algorithm. Note that it can be used to 520 * assemble the input border of any block but the root one (it will cause an 521 * <CODE>ArrayIndexOutOfBoundsException</CODE>. 522 * 523 * @param dim dimension of the input border 524 * @param r row index of the block in the block table 525 * @param c column index of the block in the block table 526 * @param lr number of row of the block 527 * @return the block's input border 528 */ 529 protected int[] assembleInputBorder (int dim, int r, int c, int lr) 530 { 531 AlignmentBlock left = null, top = null; 532 int[] input; 533 int i; 534 535 input = new int [dim]; 536 537 // set up pointers to the left and top blocks (if applicable) 538 if (c > 0) left = block_table[r][c-1]; 539 if (r > 0) top = block_table[r-1][c]; 540 541 for (i = 0; i < dim; i++) 542 { 543 if (i < lr) 544 { 545 if (left != null) 546 input[i] = left.output_border[left.factor2.length() + i]; 547 else 548 // there is no block to the left, so set a big negative value 549 // to make sure it will not be used (unfortunately, MIN_VALUE 550 // can overflow to a positive value when substracted by any 551 // number, so we use half of it as a workaround) 552 input[i] = Integer.MIN_VALUE / 2; 553 } 554 else if (i == lr) 555 { 556 if (left != null) 557 input[i] = left.output_border[left.factor2.length() + i]; 558 else 559 // no need to check if top is not null 560 // (because we assume this is not the root block) 561 input[i] = top.output_border[i - lr]; 562 } 563 else 564 { 565 if (top != null) 566 input[i] = top.output_border[i - lr]; 567 else 568 // there is no top block (see note for the left case) 569 input[i] = Integer.MIN_VALUE / 2; 570 } 571 } 572 573 return input; 574 } 575 576 /** 577 * Traverses a block to retrieve a part of an optimal alignment from the specified 578 * source in the output border to an entry in the input border. 579 * 580 * @param block the block to be traversed 581 * @param source the source of the path in the output border 582 * @param gapped_seq1 the StringBuffer to where the gapped sequence 1 is written to 583 * @param tag_line the StringBuffer to where the tag_line is written to 584 * @param gapped_seq2 the StringBuffer to where the gapped sequence 2 is written to 585 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 586 * with the loaded sequences. 587 */ 588 protected void traverseBlock (AlignmentBlock block, int source, 589 StringBuffer gapped_seq1, StringBuffer tag_line, StringBuffer gapped_seq2) 590 throws IncompatibleScoringSchemeException 591 { 592 char char1, char2; 593 594 while (block.direction[source] != STOP_DIRECTION) 595 { 596 char1 = block.factor1.getNewChar(); 597 char2 = block.factor2.getNewChar(); 598 599 switch (block.direction[source]) 600 { 601 case LEFT_DIRECTION: 602 gapped_seq1.insert (0, GAP_CHARACTER); 603 tag_line.insert (0, GAP_TAG); 604 gapped_seq2.insert (0, char2); 605 606 block = getLeftPrefix (block); 607 break; 608 609 case DIAGONAL_DIRECTION: 610 gapped_seq1.insert (0, char1); 611 if (char1 == char2) 612 if (useMatchTag()) 613 tag_line.insert (0, MATCH_TAG); 614 else 615 tag_line.insert (0, char1); 616 else if (scoreSubstitution(char1, char2) > 0) 617 tag_line.insert (0, APPROXIMATE_MATCH_TAG); 618 else 619 tag_line.insert (0, MISMATCH_TAG); 620 gapped_seq2.insert (0, char2); 621 622 block = getDiagonalPrefix (block); 623 source --; 624 break; 625 626 case TOP_DIRECTION: 627 gapped_seq1.insert (0, char1); 628 tag_line.insert (0, GAP_TAG); 629 gapped_seq2.insert (0, GAP_CHARACTER); 630 631 block = getTopPrefix (block); 632 source --; 633 break; 634 } 635 } 636 } 637 638 /** 639 * This method is a shorthand to retrieve the left prefix of a block from the block 640 * table. 641 * 642 * @param block the block 643 * @return the block's left prefix 644 */ 645 protected AlignmentBlock getLeftPrefix (AlignmentBlock block) 646 { 647 int prefix_row = block.factor1.getSerialNumber(); 648 int prefix_col = block.factor2.getAncestorSerialNumber(); 649 650 return block_table[prefix_row][prefix_col]; 651 } 652 653 /** 654 * This method is a shorthand to retrieve the diagonal prefix of a block from the 655 * block table. 656 * 657 * @param block the block 658 * @return the block's diagonal prefix 659 */ 660 protected AlignmentBlock getDiagonalPrefix (AlignmentBlock block) 661 { 662 int prefix_row = block.factor1.getAncestorSerialNumber(); 663 int prefix_col = block.factor2.getAncestorSerialNumber(); 664 665 return block_table[prefix_row][prefix_col]; 666 } 667 668 /** 669 * This method is a shorthand to retrieve the top prefix of a block from the block 670 * table. 671 * 672 * @param block the block 673 * @return the block's top prefix 674 */ 675 protected AlignmentBlock getTopPrefix (AlignmentBlock block) 676 { 677 int prefix_row = block.factor1.getAncestorSerialNumber(); 678 int prefix_col = block.factor2.getSerialNumber(); 679 680 return block_table[prefix_row][prefix_col]; 681 } 682 683 /** 684 * Computes the root block of the block table. See subclasses for actual 685 * implementation. 686 * 687 * @param factor1 the factor of the first sequence being aligned 688 * @param factor2 the factor of the second sequence being aligned 689 * @return the root block 690 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 691 * with the loaded sequences. 692 */ 693 protected abstract AlignmentBlock createRootBlock (Factor factor1, Factor factor2) 694 throws IncompatibleScoringSchemeException; 695 696 /** 697 * Computes a block at the first row (row zero) of the block table, which corresponds 698 * to an alignment block between one factor of sequence 2 and an empty string. See 699 * subclasses for actual implementation. 700 * 701 * @param factor1 the factor of the first sequence being aligned 702 * @param factor2 the factor of the second sequence being aligned 703 * @param col the column index of block in the block table 704 * @return the computed block 705 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 706 * with the loaded sequences. 707 */ 708 protected abstract AlignmentBlock createFirstRowBlock (Factor factor1, Factor factor2, 709 int col) throws IncompatibleScoringSchemeException; 710 711 /** 712 * Computes a block at the first column (column zero) of the block table, which 713 * corresponds to an alignment block between one factor of sequence 1 and an empty 714 * string. See subclasses for actual implementation. 715 * 716 * @param factor1 the factor of the first sequence being aligned 717 * @param factor2 the factor of the second sequence being aligned 718 * @param row the row index of block in the block table 719 * @return the computed block 720 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 721 * with the loaded sequences. 722 */ 723 protected abstract AlignmentBlock createFirstColumnBlock (Factor factor1, 724 Factor factor2, int row) throws IncompatibleScoringSchemeException; 725 726 /** 727 * Computes a block of the block table, which corresponds to an alignment block 728 * between one factor of sequence 1 and one factor of sequence 2. See subclasses for 729 * actual implementation. 730 * 731 * @param factor1 the factor of the first sequence being aligned 732 * @param factor2 the factor of the second sequence being aligned 733 * @param row the row index of block in the block table 734 * @param col the column index of block in the block table 735 * @return the computed block 736 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 737 * with the loaded sequences. 738 */ 739 protected abstract AlignmentBlock createBlock (Factor factor1, Factor factor2, 740 int row, int col) throws IncompatibleScoringSchemeException; 741 742 /** 743 * Retrieves an optimal alignment between the loaded sequences. See subclasses for 744 * actual implementation. 745 * 746 * @return the computed block 747 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 748 * with the loaded sequences. 749 */ 750 protected abstract PairwiseAlignment buildOptimalAlignment () 751 throws IncompatibleScoringSchemeException; 752 753 /** 754 * Locates the score of the highest scoring alignment between the two sequences in the 755 * block table after is thas been computed. See subclasses for actual implementation. 756 * 757 * @return the score of the highest scoring alignment between the loaded sequences 758 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible 759 * with the loaded sequences. 760 */ 761 protected abstract int locateScore (); 762 }